class: center, middle, inverse, title-slide .title[ #
Introduction to Big Data
Lesson 1.2 Computer science survival kit
] .author[ ### Arthur Katossky & Rémi Pépin ] .date[ ### Tuesday, March 29, 2023 ] --- # Computer science survival kit -- A computer can be abstracted by four key components: -- - processing (FR: capacité de calcul) -- - memory (FR: mémoire vive) -- - storage (FR: stockage) -- - wiring / network (FR: réseau) ??? Ask audience --- background-image: url(data:image/png;base64,#img/processors/cpu.jpg) background-size: cover --- ## Processors ??? The processor is the unit responsible for performing the actual computing. A processor can be called a "core", a "chip"... and can be differentiated into CPU (central processing units), GPU (graphical processing units) and recently TPU (tensor processing units).[^1] [1]: Formally, the chip is the physical device, and a chip may enclose one or more logical processors, called cores, in its electronic circuits. Multiple CPU cores on the same die share circuit elements, use less energy, exchange information faster (individual signals can be shorter, degrade slower and do not need to be repeated as often). A processor performance is measured in how many operations it can perform per second, typically measured in FLOP (floating-point operations per second). --- ## Processors **What is important for us to know?** -- - computation happens **_physically_** in **transistors** -- - **programs must be converted to machine-code**, i.e. to a list of formatted instructions that the processor can execute -- - **instructions are stored sequentially** in one or several stacks / threads / queues -- - one program / application may access several treads (**multi-threading**) -- - **processors can be specialized** (e.g. GPU) -- - performance is measured in number **operations** or **instructions per second** (FLOPS, IPS) ??? - the elementary computation (addition, multiplication...) happens through a **physical** process - in order to perform higher-level operations (the source code), instructions must be converted to more basal units of computation (the machine code) - the low-level instructions are stored sequentially in a stack or thread - processors draw from the stack ; multiple processors may all draw from the same stack (multi-tread architecture), or on the contrary have each their own (single-thread) - processors can be hard-wired to specialize into doing some types of computation, this is the case of GPUs and TPUs ---  .footnote[**Source:** Kurzweil ([link](http://www.kurzweilai.net/exponential-growth-of-computing)) via Our World in Data ([link](https://ourworldindata.org/technological-progress))] ??? Processors have made incredible progress over time. And this keeps happening. --- .height600[] --- .height600[] ??? "Law" by 1965 by the Intel co-founder Gordon E. Moore. --- ## Processors **Processing is a limitting factor.** - processors are **the most expensive part** of the hardware at buy time - energy consumption of the processors are the **main cost of computation** - one core can **only** perform **so many operations** per second - it is **non-trivial** to **coordinate multiple processors** on the same complex task ??? You can not just "buy more cores" to tackle a problem that takes too long. --- background-image: url(data:image/png;base64,#img/ram.jpg) background-size: cover --- ## Memory ??? Memory is the unit where the computer temporarily stores the instructions and the data on which to perform computation. It is also broadly called "cache" or "RAM"[^2]. Memory efficiency is measured by how much information it can contain and how fast in can be read from or written to. --- ## Memory **What is important for us to know ?** -- - moving data around takes time -- - memory is **fast** (ns) -- - memory is **volatile** (lost in case of power interruption) -- - memory-processor units are heavily **optimized** -- - processors always access the the hard disk through **memory caching** ??? What is important to know for us: - memory is fast, with access time typically measured in nanoseconds - memory is volatile, meaning that it is lost in case of power interruption - the easiest way to perform a computation is to move all data in memory — where it becomes accessible to the CPU --- .height500[ Source : https://aiimpacts.org/trends-in-dram-price-per-gigabyte/ ] --- ## Memory **Memory is a limitting factor.** - it is **non trivial** to work with data that **can't fit into memory** - memory is **the second-most expensive part** of a computer at buy time - memory is **shared** with other programs on the computer ¹ .footnote[ ¹ <small>Or with other users in case of shared server.</small> ] ??? We can readily perform computation only on a chunk of data that can fit into memory. RAM is very costly: it is still expensive to buy RAM but it is also expensive at use time since it needs electricity to persist data. <!-- graph of the veolution of memory size --> [2]: Formally, there is intermediate memory between RAM and the processor: the CPU register is closest to the processor, then comes the CPU cache, on which the computer moves data and instructions that need to for efficiency reasons. "Caching" in general refers to the act of copying a piece of information closer to the place it will be used, such as when storing a local copy of website while navigating over the Internet. --- background-image: url(data:image/png;base64,#img/storage/storage.png) background-size: contain --- ## Storage - for long-term storage of information -- - can have **multiple forms**: transistor plates (USB sticks, SSD), magnetic disks (hard disk, floppy disk), physical engraving (vinyls, CDs, DVDs), magnetic tape, paper (books, punch cards, bar codes, QR codes), biological (DNA), etc. -- - commonly referred to as "(disk / storage) space" or "(hard) disk" (even when no disk is involved) -- - **non-volatile**, contrary to memory -- - valuable properties: - **size** - **integrity**, resistance to degradation - **speed**, in read and write ??? Storage is the unit for long-term storage of information. It is also called "(disk) space" or "(hard) disk". Contrary to memory, the disk's non-volatile nature make it suitable to conserving information over long periods. Its first measure of efficiency is thus its size, how long it can retain information without error and how fast you can read from and write on it. There are other meda of storage, such as... paper, but also digital tapes. In reality the distinction memory vs. storage is more of a continuum. For instance, SSD disks are faster to read but more expensive to produce. --- ## Storage Storage is evaluated in **bytes** (B) or **octets** (o) and their mutiples: | Number of bytes | Symbol | Full name | Order of magnitude | |---------|----|------------|------------------| | `\(10^3\)` | ko | kilobyte | a vectorial icon | | `\(10^6\)` | Mo | megabyte | a high resolution image, a book | | `\(10^9\)` | Go | gigabyte | a high-resolution video | `\(10^{12}\)` | To | terabyte | the whole Friends series in high-resolution | | `\(10^{15}\)` | Po | petabyte | Spotify database | | `\(10^{18}\)` | Eo | exabyte | monthly Internet traffic | .footnote[ Since `\(2^{10}=1024 \simeq 10^3\)` and since storage is essentially binary, you often find the convention `\(1 \text{Mo} = 2^{10} \text{ko}\)`. The official symbols `\(\text{Kio}\)`, `\(\text{MiB}\)`... never really caught on. ] --- .height600[] --- .height600[ <a href="https://www.statista.com/statistics/795748/worldwide-seagate-average-hard-disk-drive-capacity/" rel="nofollow"><img src="data:image/png;base64,#https://www.statista.com/graphic/1/795748/worldwide-seagate-average-hard-disk-drive-capacity.jpg" alt="Statistic: Average capacity of Seagate hard disk drives (HDDs) worldwide from FY2015 to FY2021, by quarter (in terabyte per drive) | Statista"/></a> ] --- ## Storage **What is important for us to know ?** -- - storage is **faillible** -- - writing and reading is **slow** (ms) -- - **data is always cached** (copied / charged into memory) for computation ??? What is important to know for us: - storage is faillible: errors do occur and information slowly degrades (phenomenon known as data decay: https://en.wikipedia.org/wiki/Data_degradation ; solid state memory like flash card may be corrupted by current ; magnetic field can corrupt disks, tapes ; scratches and dust can corrupt DVDs ; rotting / overuse can corrupt books etc. ; keeping copies often mitigates the problem, but at the cost of maintaining coherence!) - storage is slow, typically measured in milliseconds [^3] - computation cannot be performed directly from disk, it has to be temporarily copied into memory (or "cached") [3]: writing and reading on a disk requires the disk to turn and the reading head to move ; on the top of that, files are split into chunks that are stored at different places and the computer needs a registry to record where what is where --- ## Storage **Storage is a limiting factor.** - on a given computer, **you can only store so much** -- - dealing with files **distributed** over multiple computers is **non-trivial** -- </br>Cost and speed are usually not an issue. ??? On a given computer, we can only store so much. Today's price of disk space does make it less limiting economically. But if data exceeds the size of the disk, data must be split between several disks, which complicates programming (**distributed storage**). <!-- average storage capacity ; graph of information degradation over time --> ??? 2 more things before we move on: network and compilation / intepretation --- ## Network - information transfer is **time-consuming** - usually not an issue on a personal computer: - processors and memory are **closely integrated** in the same circuits - physical connection between memory and disk is **short** and **fast** - becomes an issue with **remote** (or **distributed**) storage (or computing) --- ## Network For transferring volumes above 1 Po (1024 To) to their servers, Amazon actually sends a truck, which is faster than a fast Internet connection (https://aws.amazon.com/fr/snowmobile). .height400[] --- ## IN THE NEXT SECTION Each of the basic components of a computer can cause issues. - disk space is a limiting factor -> consider **distributed storage** - even when data is relatively small, computation can be a challenge - memory size and number of processors are a limiting factor -> consider **distributed computing** But distribution is **non-trivial** (future lesson) and often not needed. Always consider first: - consider simple **good practices** (next section) - consider commercial **remote**, **cloud-based services** (future lesson)